Background and Motivation

Reddit is known as the “Front Page of the Internet” and is a popular forum especially among young people where users can post anything and everything. Unlike other social media platforms the majority of the Reddit users remain anonymous. We believe that the anonymity of the forum allows us to train and test NLP models. It has a large international community and a lot of programming related content. We want to use data from the Reddit forum in order to better understand the popularity of Programming languages among Reddit users.

Additionally we want to compare it to data from the Stackoverflow forum. The Stackoverflow forum is a forum that is more focussing on solving programming related Problems.

We want to evaluate what programming languages are being discussed in both forums and compare them. For reaching our aim we want to use Visualization methods Machine Learning methods based on text data but also use the quantification that we get from the upvotes and number of comments.

Research Questions

  • Can we decide which topic/language a certain post is about?
  • How do number of upvotes, comments and number of posts correlate to popularity?
  • How does the popularity of programming languages change over time?
  • Can we predict the popularity of programming languages in the future?
  • How do the two platforms compare based on programming languages?

Design overview

For the Reddit posts the plan is to use an API from Reddit to get data sets for a certain time range and a number of specific Subreddits. The choice of the Subreddits is crucial for the quality and expressiveness of our data and will be based on some prior research on interesting Subreddits regarding programming. From this data we can then get the Subreddit, title, text, upvotes and various metadata.

Data set example

data.subreddit data.title data.id data.created data.created_utc data.upvote_ratio data.ups data.score data.num_comments
coding Back-End VS Front-End Framework | 6 J.S. Frameworks Experts Love - Untied Blogs nh0yzf 1621547972 1621519172 0.33 0 0 1
coding File Descriptor Limits ngzeep 1621543958 1621515158 0.50 0 0 0
coding Introduction to Continuous Profiling ngy73c 1621540423 1621511623 0.93 21 21 2

Data Preprocessing

  • Fetching the Reddit data from the pushshift API
  • Finding relevant subreddits
  • Selecting the features that should be used
  • Textpreprocessing:
    • Removing punctuation and stopwords
    • Stemming
    • Vectorization

Explorative Data Analyis and Visualization

  • Exploratory data analysis using box plots, histograms, scatter plot etc. on upvotes and number of comments and number of posts for every programming language
  • Showing the trend with a line plot and confidence interval
  • Using 2D embedding of the posts and using a scatterplot to show similarity
  • Comparing stackoverflow and Reddit in the plots

Classification of topics

  • Using topic modeling using LDA - Latent Dirichlet Allocation
  • Using decision trees or ensemble methods
  • Using sentiment analysis and clustering to evaluate if a post is a professional question or an opinion

Prediction of future popularity

  • ARIMA
  • SARIMA
  • Exponential Smoothing
  • Random Forest (based on CART)

Shiny application

  • Using shiny as a tool for interactive exploration of the data set
    • Sliders for selecting a time range of interest
    • Checkboxes or Dropdown menus to select different programming languages
  • Bonus: Shiny offers the opportunity of refreshing the data sets on a web page making it possible to get new insights everyday

Time plan